Extract Content (Web Mining)
Synopsis
Extracts content from an HTML document.Description
This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. in navigation bars) to be kept.
Input
- document
The document port.
Output
- document
The document port.
Parameters
- extract contentSpecifies whether content is extracted or not
- minimum text block lengthThe minimum length (in words/tokens) of text blocks.
- override content type informationSpecifies whether potentially existing content type information and used encoding information should be overriden using the HTML meta http-equiv tag.
- neglegt span tagsSpecifies whether <span> tags should be neglected or used as text block divider.
- neglect p tagsSpecifies whether <p> tags should be neglected or used as text block divider.
- neglect b tagsSpecifies whether <b> tags should be neglected or used as text block divider.
- neglect i tagsSpecifies whether <i> tags should be neglected or used as text block divider.
- neglect br tagsSpecifies whether <br> tags should be neglected or used as text block divider.
- ignore non html tagsSpecifies whether tags that are not common HTML should be ignored.